Data Stream Mining

Question 1

For the following graph:

   C -- D -- E
 / |    |    | \
A  |    |    |  B
 \ |    |    | /
   F -- G -- H

Write the adjacency matrix A, the degree matrix D, and the Laplacian matrix L. For each, find the sum of all entries and the number of nonzero entries.


In [1]:
import numpy as np

In [2]:
A = np.array([#A B C D E F G H
             [0,0,1,0,0,1,0,0],
             [0,0,0,0,1,0,0,1],
             [1,0,0,1,0,1,0,0],
             [0,0,1,0,1,0,1,0],
             [0,1,0,1,0,0,0,1],
             [1,0,1,0,0,0,1,0],
             [0,0,0,1,0,1,0,1],
             [0,1,0,0,1,0,1,0]])

print A


[[0 0 1 0 0 1 0 0]
 [0 0 0 0 1 0 0 1]
 [1 0 0 1 0 1 0 0]
 [0 0 1 0 1 0 1 0]
 [0 1 0 1 0 0 0 1]
 [1 0 1 0 0 0 1 0]
 [0 0 0 1 0 1 0 1]
 [0 1 0 0 1 0 1 0]]

In [3]:
D = np.sum(A,0)
D = np.diag(D)
print D


[[2 0 0 0 0 0 0 0]
 [0 2 0 0 0 0 0 0]
 [0 0 3 0 0 0 0 0]
 [0 0 0 3 0 0 0 0]
 [0 0 0 0 3 0 0 0]
 [0 0 0 0 0 3 0 0]
 [0 0 0 0 0 0 3 0]
 [0 0 0 0 0 0 0 3]]

In [4]:
L = D - A
print L


[[ 2  0 -1  0  0 -1  0  0]
 [ 0  2  0  0 -1  0  0 -1]
 [-1  0  3 -1  0 -1  0  0]
 [ 0  0 -1  3 -1  0 -1  0]
 [ 0 -1  0 -1  3  0  0 -1]
 [-1  0 -1  0  0  3 -1  0]
 [ 0  0  0 -1  0 -1  3 -1]
 [ 0 -1  0  0 -1  0 -1  3]]

In [5]:
for matrix, tag in zip((A, D, L), ("Adjacency A", "Degree D", "Laplacian L")):
    print "For {}, Entry sum:".format(tag), np.sum(matrix), "\t", "Non-zero entries: ", np.sum(np.where(matrix != 0, 1, 0))


For Adjacency A, Entry sum: 22 	Non-zero entries:  22
For Degree D, Entry sum: 22 	Non-zero entries:  8
For Laplacian L, Entry sum: 0 	Non-zero entries:  30

Question 2

We wish to estimate the surprise number (2nd moment) of a data stream, using the method of AMS. It happens that our stream consists of ten different values, which we'll call 1, 2,..., 10, that cycle repeatedly. That is, at timestamps 1 through 10, the element of the stream equals the timestamp, at timestamps 11 through 20, the element is the timestamp minus 10, and so on. It is now timestamp 75, and a 5 has just been read from the stream. As a start, you should calculate the surprise number for this time.

For our estimate of the surprise number, we shall choose three timestamps at random, and estimate the surprise number from each, using the AMS approach (length of the stream times 2m-1, where m is the number of occurrences of the element of the stream at that timestamp, considering all times from that timestamp on, to the current time). Then, our estimate will be the median of the three resulting values.

You should discover the simple rules that determine the estimate derived from any given timestamp and from any set of three timestamps. Then, identify the set of three "random" timestamps that give the closest estimate from {20, 49, 53}, {17, 43, 51}, {25, 34, 47}, {37, 46, 55}.


In [6]:
import json
import pprint

def surprise(seq):
    timestamp = {}
    for n in seq:
        timestamp[n] = timestamp.get(n, 0) + 1
    pprint.pprint(timestamp)
    return sum([v*v for k,v in timestamp.items()])

In [7]:
triples = [[20, 49, 53],
           [17, 43, 51],
           [25, 34, 47],
           [37, 46, 55]]

In [8]:
def median(triple):
    return sorted(triple)[len(triple)>>1]

In [9]:
def estimate(r, s):
    v = [sum([v == s[x-1] and p>=(x-1) for p, v in enumerate(s)]) for x in r]
    return [len(s) * (2*m - 1) for m in v]

In [10]:
def main():
    s = [(i % 10) + 1 for i in range(0, 75)]
    print surprise(s)
    
    for a in triples:
        print '%s => %d' % (json.dumps(a), median(estimate(a, s)))
    
    return 0

In [11]:
if __name__ == '__main__':
    main()


{1: 8, 2: 8, 3: 8, 4: 8, 5: 8, 6: 7, 7: 7, 8: 7, 9: 7, 10: 7}
565
[20, 49, 53] => 375
[17, 43, 51] => 525
[25, 34, 47] => 675
[37, 46, 55] => 375

Question 3

We wish to use the Flagolet-Martin lgorithm to count the number of distinct elements in a stream. Suppose that there ten possible elements, 1, 2,..., 10, that could appear in the stream, but only four of them have actually appeared. To make our estimate of the count of distinct elements, we hash each element to a 4-bit binary number. The element x is hashed to 3x + 7 (modulo 11). For example, element 8 hashes to 3x8+7 = 31, which is 9 modulo 11 (i.e., the remainder of 31/11 is 9). Thus, the 4-bit string for element 8 is 1001.

A set of four of the elements 1 through 10 could give an estimate that is exact (if the estimate is 4), or too high, or too low. You should figure out under what circumstances a set of four elements falls into each of those categories. Then, identify the set of four elements that gives the exactly correct estimate from {3, 4, 8, 10}, {1, 2, 3, 9}, {4, 5, 6, 7}, { 3, 7, 8, 10}.


In [12]:
import json

streams = [[3, 4, 8, 10],
           [1, 2, 3, 9],
           [4, 5, 6, 7],
           [3, 7, 8, 10]]

In [13]:
def mod(x):
    return (3*x + 7) % 11

In [14]:
def bit(x):
    m = [0xf, 0x7, 0x3, 0x1]
    for i in range(0, 4):
        if m[i] & x == 0:
            return 4-i
    return 0

In [15]:
def main():
    for stream in streams:
        r = max([bit(mod(x)) for x in stream])
        print '%s = %d' % (json.dumps(stream), r*r)

In [16]:
if __name__ == '__main__':
    main()


[3, 4, 8, 10] = 9
[1, 2, 3, 9] = 1
[4, 5, 6, 7] = 16
[3, 7, 8, 10] = 4

Correct answer: { 3, 7, 8, 10} ie. estimate = 4.

Question 4

A certain Web mail service has 10^8 users and wishes to create a sample of data about these users, occupying 10^10 bytes. Activity at the service can be viewed as a stream of elements, each of which is an email. The element contains the ID of the sender, which must be one of the 10^8 users of the service, and other information, e.g., the recipient(s), and contents of the message. The plan is to pick a subset of the users and collect in the 10^10 bytes records of length 100 bytes about every email sent by the users in the selected set (and nothing about other users).

User ID's will be hashed to a bucket number, from 0 to 999,999. At all times, there will be a threshold t such that the 100-byte records for all the users whose ID's hash to t or less will be retained, and other users' records will not be retained. You may assume that each user generates emails at exactly the same rate as other users. As a function of n, the number of emails in the stream so far, what should the threshold t be in order that the selected records will not exceed the 10^10 bytes available to store records? Identify the value of n and its corresponding value of t.


In [17]:
def email(n):
    t = (10**14)/n-1
    return t

print "t-value when n=10^9: ", email(10**9)
print "t-value when n=10^10: ", email(10**10)
print "t-value when n=10^11: ", email(10**11)
print "t-value when n=10^12: ", email(10**12)
print "t-value when n=10^13: ", email(10**13)
print "t-value when n=10^14: ", email(10**14)


t-value when n=10^9:  99999
t-value when n=10^10:  9999
t-value when n=10^11:  999
t-value when n=10^12:  99
t-value when n=10^13:  9
t-value when n=10^14:  0

In [ ]: